This project is for the course Introduction to Open Data Science in University of Helsinki, fall 2018. The learning objectives of this course are to understand possibilities of open science, open data, and open research tools, visualize and analyze data sets with some fairly advanced statistical methods, master a few state-of-the-art, freely available, open software tools, apply the principle of reproducible research and see its practical advantages, and learn more of all this in the near and distant future (“life-long learning”). My github repository can be found at: https://github.com/ellitaimela/IODS-project
The data analysis exercise for the second course week consisted of analysing students’ learning data from the course Introduction to Social Statistics, collected during December 2014 and January 2015. The exercise consisted of reading and exploring the structure of a dataset provided, showing a graphical overview and summaries of the data, and finally creating and analysing a regression model from the data.
I retrieved the data provided for this exercise from AWS. The data, formatted in a .txt file, contained data of students’ learning outcomes in the course Introduction to Social Statistics held in fall 2014. The data included 166 observations of 7 variables. The variables represented the students’ gender, age, and attitude towards the course, and the students’ success in exam/exercise questions related to deep learning, strategic learning, and surface learning, and the granted for the students.
# Reading the data
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",
sep=",", header=TRUE)
# The data includes students' learning data from the course Introduction to Social Statistics,
# collected during 3.12.2014 - 10.1.2015
# Exploring the structure and dimensions of the data; 166 observations, of 7 variables
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166 7
# Summaries of the variables
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
I created a graphical overview of the distribution of the variables and their relationships. First, I started with boxplotting the whole dataset to see and compare the distributions of the variables. Second, I created a relationship plot matrix with ggpairs to see the overall picture of the relationships between variables. After this, I examined the distributions of variables in more detail. I showed the distribution of students by gender with a barplot, and for the rest of the variables, I examined the distributions with histograms. I also analyzed the relationships of the variables with qplots and plots.
library(ggplot2)
library(GGally)
# Plotting the distributions of the variables
boxplot(lrn14)
# Creating a plot matrix with ggpairs
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))
# 110 females, 56 males
barplot(table(lrn14$gender))
# Age distribution from 17 to 55, median 22, mean 25.51, standard deviation 7.766078
hist(lrn14$age)
# Attitude from 1 to 5, median 3.2, mean 3.14, standard deviation 0.5327656
hist(lrn14$attitude)
# Points of deep learning questions from 1 to 5, median 3.667, mean 3.680, sd 0.5541369
hist(lrn14$deep)
# Points of strategic learning questions from 1 to 5, median 3.188, mean 3.121, sd 0.7718318
hist(lrn14$stra)
# Points of surface learning questions from 1 to 5, median 2.833, mean 2.787, sd 0.5288405
hist(lrn14$surf)
# Points granted altogether from 7 to 33, median 23.00, mean 22.72, sd 5.894884
hist(lrn14$points)
# A positive colleration between attitude and points can be found
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
qplot(age, points, data = lrn14) + geom_smooth(method = "lm")
qplot(surf, points, data = lrn14) + geom_smooth(method = "lm")
# There is no correlation between age and attitude
qplot(age, attitude, data = lrn14) + geom_smooth(method = "lm")
# The correlation between age to points achieved in strategic and deep questions looks minimal
qplot(age, stra, data = lrn14) + geom_smooth(method = "lm")
qplot(age, deep, data = lrn14) + geom_smooth(method = "lm")
# There can be seen some negative correlation between age and points achieved in surface questions - statistical significance not analyzed
qplot(age, surf, data = lrn14) + geom_smooth(method = "lm")
# On average, men achieved marginally higher points compared to women - statistical significance not analyzed
plot(lrn14$points ~ lrn14$gender)
# On average, women achieved higher points in strategic and surface learning questions - statistical significance not analyzed
plot(lrn14$stra ~ lrn14$gender)
plot(lrn14$deep ~ lrn14$gender)
plot(lrn14$surf ~ lrn14$gender)
I started building the regression model by choosing three variables - attitude, stra, and surf - as explanatory variables and fitting a regression model where exam points was the dependent variable. I chose the variables because they had the hichest correlation with points, as could be seen in the ggpairs matrix.
# Fitting a linear model
model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Of the explanatory variables, only attitude turned out statistically significant, its p-value being 1.93e-08. The p-values of stra and surf were not statistically significant, so I excluded them from the further analysis. I eventually tested combinations of all variables in the dataset, but I found no other statistically significant explanatory variable than attitude. Therefore, I created a new regression model with attitude as the only explanatory variable. Statistically significant (p<0.0001) estimates for both the intercept and attitude were found.
# Fitting a new model with attitude as the only explanatory variable
model2 <- lm(points ~ attitude, data = lrn14)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As can be seen in the summary of the regression model above (model2), exam points are positively correlated with the one’s attitude. This is intuitional, because a high (positive) attitude towards a specific topic typically equals a higher interest and motivation to learn, which typically leads one to investing more time and effort to the topic. OF the summary, we can see that one increase in attitude (scale from 1 to 5) increases points by 3.5255 on average.
The residual sum of squares (multiple R-squred) represents how close the data is to the fitted regression line of the model. R-squared achieves values from 0 to 100 % (or 1), which describes the percentage of the target variable variation that can be explained by the model. Multiple R-squared of model2 is 0.1906, which means that a minor part of the data lies close to the regression line.
One natural assumption of the linear regression model created above is linearity. Another assumption related to linear regression models is that errors are normally distributed, they are not correlated, and have a constant variance. To analyze whether these assumptions are actually valid, we can take a closer look at the residuals of the model.
The normality of the errors can be analyzed with the Q-Q plot. As can be seen, a clear majority of the residuals follow the line. Only the residuals at very low and very high values deviate from the line. Therefore we can assume the error terms mostly to be normally distributed.
By plotting the residuals towards the fitted values of the model, we can see if there is any pattern that tells how the residuals change and deduce whether the variance of the residuals is constant. As we look at the Residuals vs. Fitted plot below, there does not seem to be a clear pattern. Therefore we can consider the assumption of a constant variance somewhat valid.
By examining the Residuals vs. Leverage plot below, we can also analyze the impact of single observations on the model. As we can see, no single observation stands out clearly, and the value of the leverages of all observations are low. Therefore we can reason that the impact of single observations on the model is not too high.
par(mfrow = c(2,2))
# Visualizing Residuals vs. Fitted values
plot(model2, which = 1)
# Visualizing a normal QQ-plot
plot(model2, which = 2)
# Visualizing Residuals vs. Leverage
plot(model2, which = 5)
The data analysis exercise for the third course week consisted of analysing the relationship between students’ alcohol consumption and learning outcomes.
A new markdown file ‘chapter3.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.
I read the data formulated in the Data Wrangling part of this week’s exercise, and checked the structure, dimensions and summary of the dataset. The data describes student achievement in secondary education at two Portuguese schools. The data attributes included student grades, demographic, social and school related features, and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)
# Setting the working directory
setwd("/Users/ellitaimela/Documents/Github/IODS-project/Data")
# Reading the data
alc <- read.csv("alc.csv", sep=",")
# This data approach student achievement in secondary education of two Portuguese schools.
# The data attributes include student grades, demographic, social and school related features)
# and it was collected by using school reports and questionnaires. Two datasets are provided
# regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
# Checking the data, and the structure, dimensions and summary of the dataset
# alc
str(alc)
## 'data.frame': 382 obs. of 36 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382 36
summary(alc)
## X school sex age address famsize
## Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278
## 1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104
## Median :191.50 Median :17.00
## Mean :191.50 Mean :16.59
## 3rd Qu.:286.75 3rd Qu.:17.00
## Max. :382.00 Max. :22.00
## Pstatus Medu Fedu Mjob Fjob
## A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
I chose 4 variables to study their relationships with the variables and high/low alcohol consumption. The variables I chose were famrel (quality of family relationships), goout (going out with friends), absences (number of school absences), and G3 (final grade). According to my initial hypothesis, students who have worse relationships with their family tend to feel worse and consume more alcohol. I also hypothesized that going out with friends and the number of school absences positively correlate with alcohol consumption, and the final grade negatively correlates with alcohol consumption.
I draw boxplots of the chosen variables distributions with division to high and low alcohol consumption. As can be seen, the results are in line with my initial hypotheses. The average grades are higher with students that consume less alcohol, and students who have better relationships with their family members consume less alcohol. Students who go out with friends more or have more absences also consume more alcohol. The numerical differences of the means of the variables can also be seen below. However, the statistical signifigance cannot be deduced from these results.
# initialize a plot of high_use and family relationships
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("sex")
# initialize a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("going out")
# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = absences))
# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("absences")
# initialize a plot of high_use and G3
g4 <- ggplot(alc, aes(x = high_use, y = G3))
# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("grade")
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), famrel = mean(famrel), goout = mean(goout), absences = mean(absences), mean_grade = mean(G3))
## # A tibble: 2 x 6
## high_use count famrel goout absences mean_grade
## <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 268 4.00 2.85 3.71 11.7
## 2 TRUE 114 3.78 3.72 6.37 10.8
I created a logistic regression model m with the boolean variable high_use as the target variable. High_use returns true if a student consumes alcohol highly. The explaining variables are the previously chosen variables: famrel, goout, absences, and G3. According to the results (see the output of summary(m)), going out and the number of absences positively correlate with high alcohol consumption very significantly (p<0.001). Family relationships (1 = bad, 5 = good) are negatively correlated with alcohol consumption (p<0.05). The relationship between alcohol consumption and the final grade G3 was not statistically significant (p = 0.24).
I combined and printed out the odds ratios and their confidence intervals. As can be seen from the print below, the predictor goout has the widest confidence interval. The interval of G3 is contains the value 1, with an odd ratio of 0.95 - therefore it is hard to determine if G3 is positively or negatively associated with high_use. The odd ratios and confidence intervals tell that goout is (positively) associated with high_use the most. Famrel is negatively associated with high_use (the whole confidence interval <1). Absences are positively associated with high_use, but only a littse, since the confidence interval stands between 1.032 and 1.125.
# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + G3, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9389 -0.7743 -0.5392 0.9022 2.3966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.83044 0.78397 -2.335 0.019552 *
## famrel -0.34483 0.13563 -2.542 0.011010 *
## goout 0.75149 0.12033 6.245 4.23e-10 ***
## absences 0.07261 0.02174 3.340 0.000839 ***
## G3 -0.04482 0.03843 -1.166 0.243529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 394.43 on 377 degrees of freedom
## AIC: 404.43
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout absences G3
## -1.83043900 -0.34482682 0.75148543 0.07261377 -0.04481705
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1603432 0.0331851 0.724006
## famrel 0.7083430 0.5415610 0.923423
## goout 2.1201470 1.6848948 2.703401
## absences 1.0753151 1.0316028 1.124827
## G3 0.9561724 0.8866759 1.031252
I refined my model and excluded the final grade G3 from it because the relationship with alcohol consumption was not statistically significant.
As can be see below from the 2x2 cross tabulation, 69 of (244 + 69 =) 313 students who were predicted as not low users were actually high users of alcohol. 24 of (24 + 45 =) 69 students who were predicted as high users were actually low users. Therefore (69+24)/(313+69) = 0.243455 = 24.3 percent of the individuals were classified inaccurately. The same results can be achieved by building a similar loss function as in the DataCamp exercise. As can be seen, the prediction power is not perfect but it is higher compared to a simple, totally random guessing strategy where the probability of guessing righ from two options would be only 50 percent.
# find the model with glm()
m2 <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m2)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9018 -0.7731 -0.5466 0.9002 2.4180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.38825 0.63161 -3.781 0.000156 ***
## famrel -0.34986 0.13534 -2.585 0.009735 **
## goout 0.77071 0.11981 6.433 1.25e-10 ***
## absences 0.07446 0.02174 3.425 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 395.79 on 378 degrees of freedom
## AIC: 403.79
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m2)
## (Intercept) famrel goout absences
## -2.38824753 -0.34985642 0.77071403 0.07445633
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 24
## TRUE 69 45
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63874346 0.06282723 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
According to a 10-fold cross-validation of the model m2, the prediction error with the test set gets values of around 0.23 to 0.25, which are smaller than the error of the model introduced in DataCamp (circa 0.26 error). The code and results can be seen below.
library(boot)
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486911
The data analysis exercise for the third course week consisted of analysing crime rates in the suburbs of Boston.
A new markdown file ‘chapter4.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.
I read the Boston data that includes housing values in the suburbs of Boston. Columns include data of for example per capita crime rate, proportion of non-retail business acres, and weighted mean of distances to five Boston employment centres by town. The dimensions of the data are 506 x 14.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(GGally)
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v readr 1.1.1 v stringr 1.3.1
## v tibble 1.4.2 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
# Loading the data
data(Boston)
# Checking the data, and the structure, dimensions and summary of the dataset
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
I the graphical overview of the data. As can be seen from the ggpairs matrix and the summary statistics, the distributions vary between the variables. Crim, for example, can be considered a Poisson distribution, and rm a normal distribution. Many of the distributions are left- or right-taled.
The correlations between variables seem logical. For example crime rates, industrial areas closeby, nitrogen oxides concentration, the high age of the buildings, the accessibility to radial highways, the property tax rate, the pupil-teacher ratio, and the lower status of the population negatively correlate with the median value of occupied homes within a town. The Charles River dummy variable has the smallest absolute correlation among other variables.
# Graphical overview of the data
pairs(Boston)
ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# Summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
I standardized the dataset with the scale-function. The data changed such that the median value of each variable returns 0, and the scaled values have the same variance and form of distribution than the original dataset. I also created a categorical variable of the crime rate in the Boston dataset from the scaled crime rate. I used the quantiles as the break points in the catgorical variable. The old crime rate variable was then dropped from the dataset. Then I divided the dataset to train and test sets so that 80 percent of the data belonged to the train set and 20 percent to the test set.
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled dataset
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
label_vector <- c("low", "med_low", "med_high", "high")
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = label_vector)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#remove the origical crime rate crim and add crime
boston_scaled <- subset(boston_scaled, select = -c(crim))
boston_scaled$crime <- crime
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
I fit the Linear Discriminant Analysis on the train set. I used the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, I drew the LDA plot.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2500000 0.2524752 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.04057307 -0.9158927 -0.07933396 -0.8849085 0.5228532
## med_low -0.07969513 -0.2767872 -0.03844192 -0.5712067 -0.1312078
## med_high -0.38551208 0.1370886 0.26805724 0.3356399 0.1445756
## high -0.48724019 1.0171737 -0.03371693 1.0386422 -0.3698266
## age dis rad tax ptratio
## low -0.8949757 0.8783621 -0.7082658 -0.7567536 -0.523784528
## med_low -0.3107823 0.3748689 -0.5440892 -0.4646832 -0.006443321
## med_high 0.4252114 -0.3758448 -0.4132674 -0.3270467 -0.226263117
## high 0.7989934 -0.8340679 1.6375616 1.5136504 0.780117019
## black lstat medv
## low 0.3737511 -0.78426915 0.61981760
## med_low 0.3091010 -0.14065827 -0.02466708
## med_high 0.1065346 0.01940936 0.18958840
## high -0.7808641 0.86941597 -0.64595193
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0806145506 0.785768591 -0.90194935
## indus 0.0003166139 -0.129250049 0.31497574
## chas -0.0838745716 -0.081220574 0.04910057
## nox 0.4011851216 -0.716609258 -1.40016652
## rm -0.0946999313 -0.062982691 -0.22278107
## age 0.3010893364 -0.355532119 -0.04893558
## dis -0.0194243246 -0.363184403 0.25444871
## rad 3.1415001222 1.013436267 -0.08687102
## tax 0.0545009694 -0.144443911 0.65013377
## ptratio 0.0855368176 0.024988320 -0.25696948
## black -0.1169816299 -0.006244425 0.10238013
## lstat 0.2863303041 -0.246798853 0.11824671
## medv 0.2274576490 -0.321434628 -0.29175433
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9495 0.0367 0.0138
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
I saved the crime categories from the test set and removed the categorical crime variable from the test dataset. After this, I predicted the classes with the LDA model on the test data. When cross-tabulating the results with the crime categories from the test set, you can see that the model is more capable in predicting high crime rates compared to low crime rates.
# save the correct classes from test data
correct_classes <- test$crime
test <- subset(test, select = -c(crime))
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
str(lda.pred)
## List of 3
## $ class : Factor w/ 4 levels "low","med_low",..: 2 2 3 3 2 2 2 2 1 1 ...
## $ posterior: num [1:102, 1:4] 0.20532 0.10012 0.01338 0.00623 0.09736 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "2" "7" "8" "9" ...
## .. ..$ : chr [1:4] "low" "med_low" "med_high" "high"
## $ x : num [1:102, 1:3] -3.182 -1.862 -1.2 -0.903 -1.983 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "2" "7" "8" "9" ...
## .. ..$ : chr [1:3] "LD1" "LD2" "LD3"
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 12 0 0
## med_low 5 15 5 0
## med_high 0 6 17 1
## high 0 0 0 28
I reloaded the Boston dataset, standardized the dataset, calculated the distances between the observations, and ran the k-means algorithm on the dataset. When visualizing the clusters with pairs, you can see that the optimal number of clusters turns out to be 3.
# load MASS and Boston
library(MASS)
data('Boston')
boston_scaled.new <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(boston_scaled.new)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled.new, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
At the final week we analysed longitudinal data. We started with graphical displays and the summary method approach, and continued with linear mixed effects models for normal response variables.
Data Wrangling
The code for data wrangling can be found at my Github repository. The code without prints or graphs is implemented below to enable further analysis.
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",
sep=" ", header=TRUE)
rats <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
sep="\t", header=TRUE)
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
# Convert to long form
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
# Extract the week number
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks, 5, 5)))
# Convert to long form
ratsl <- rats %>% gather(key = WDs, value = weight, -ID, -Group)
# Extract the wd
ratsl <- ratsl %>% mutate(WD = as.integer(substr(WDs, 3, 4)))
Part I: Graphical Displays and Summary Method Approach
The data included observations of 16 subjects (rats) that were divided into 3 groups. When analysing the data in the long form, we can analyse the individual response profiles by treatment group by building a ggplot that visualizes the responses of the individual subjects accross time. We can see that the rats in the Group 1 (n=8) have lower weights compared to Groups 2 (n=4) and 3 (n=4). The weights of the rats at Group 2 have the highest variance: one rat has significantly higher weight than the others.
ratsl
## ID Group WDs weight WD
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
## 7 7 1 WD1 275 1
## 8 8 1 WD1 245 1
## 9 9 2 WD1 410 1
## 10 10 2 WD1 405 1
## 11 11 2 WD1 445 1
## 12 12 2 WD1 555 1
## 13 13 3 WD1 470 1
## 14 14 3 WD1 535 1
## 15 15 3 WD1 520 1
## 16 16 3 WD1 510 1
## 17 1 1 WD8 250 8
## 18 2 1 WD8 230 8
## 19 3 1 WD8 250 8
## 20 4 1 WD8 255 8
## 21 5 1 WD8 260 8
## 22 6 1 WD8 265 8
## 23 7 1 WD8 275 8
## 24 8 1 WD8 255 8
## 25 9 2 WD8 415 8
## 26 10 2 WD8 420 8
## 27 11 2 WD8 445 8
## 28 12 2 WD8 560 8
## 29 13 3 WD8 465 8
## 30 14 3 WD8 525 8
## 31 15 3 WD8 525 8
## 32 16 3 WD8 510 8
## 33 1 1 WD15 255 15
## 34 2 1 WD15 230 15
## 35 3 1 WD15 250 15
## 36 4 1 WD15 255 15
## 37 5 1 WD15 255 15
## 38 6 1 WD15 270 15
## 39 7 1 WD15 260 15
## 40 8 1 WD15 260 15
## 41 9 2 WD15 425 15
## 42 10 2 WD15 430 15
## 43 11 2 WD15 450 15
## 44 12 2 WD15 565 15
## 45 13 3 WD15 475 15
## 46 14 3 WD15 530 15
## 47 15 3 WD15 530 15
## 48 16 3 WD15 520 15
## 49 1 1 WD22 260 22
## 50 2 1 WD22 232 22
## 51 3 1 WD22 255 22
## 52 4 1 WD22 265 22
## 53 5 1 WD22 270 22
## 54 6 1 WD22 275 22
## 55 7 1 WD22 270 22
## 56 8 1 WD22 268 22
## 57 9 2 WD22 428 22
## 58 10 2 WD22 440 22
## 59 11 2 WD22 452 22
## 60 12 2 WD22 580 22
## 61 13 3 WD22 485 22
## 62 14 3 WD22 533 22
## 63 15 3 WD22 540 22
## 64 16 3 WD22 515 22
## 65 1 1 WD29 262 29
## 66 2 1 WD29 240 29
## 67 3 1 WD29 262 29
## 68 4 1 WD29 265 29
## 69 5 1 WD29 270 29
## 70 6 1 WD29 275 29
## 71 7 1 WD29 273 29
## 72 8 1 WD29 270 29
## 73 9 2 WD29 438 29
## 74 10 2 WD29 448 29
## 75 11 2 WD29 455 29
## 76 12 2 WD29 590 29
## 77 13 3 WD29 487 29
## 78 14 3 WD29 535 29
## 79 15 3 WD29 543 29
## 80 16 3 WD29 530 29
## 81 1 1 WD36 258 36
## 82 2 1 WD36 240 36
## 83 3 1 WD36 265 36
## 84 4 1 WD36 268 36
## 85 5 1 WD36 273 36
## 86 6 1 WD36 277 36
## 87 7 1 WD36 274 36
## 88 8 1 WD36 265 36
## 89 9 2 WD36 443 36
## 90 10 2 WD36 460 36
## 91 11 2 WD36 455 36
## 92 12 2 WD36 597 36
## 93 13 3 WD36 493 36
## 94 14 3 WD36 540 36
## 95 15 3 WD36 546 36
## 96 16 3 WD36 538 36
## 97 1 1 WD43 266 43
## 98 2 1 WD43 243 43
## 99 3 1 WD43 267 43
## 100 4 1 WD43 270 43
## 101 5 1 WD43 274 43
## 102 6 1 WD43 278 43
## 103 7 1 WD43 276 43
## 104 8 1 WD43 265 43
## 105 9 2 WD43 442 43
## 106 10 2 WD43 458 43
## 107 11 2 WD43 451 43
## 108 12 2 WD43 595 43
## 109 13 3 WD43 493 43
## 110 14 3 WD43 525 43
## 111 15 3 WD43 538 43
## 112 16 3 WD43 535 43
## 113 1 1 WD44 266 44
## 114 2 1 WD44 244 44
## 115 3 1 WD44 267 44
## 116 4 1 WD44 272 44
## 117 5 1 WD44 273 44
## 118 6 1 WD44 278 44
## 119 7 1 WD44 271 44
## 120 8 1 WD44 267 44
## 121 9 2 WD44 446 44
## 122 10 2 WD44 464 44
## 123 11 2 WD44 450 44
## 124 12 2 WD44 595 44
## 125 13 3 WD44 504 44
## 126 14 3 WD44 530 44
## 127 15 3 WD44 544 44
## 128 16 3 WD44 542 44
## 129 1 1 WD50 265 50
## 130 2 1 WD50 238 50
## 131 3 1 WD50 264 50
## 132 4 1 WD50 274 50
## 133 5 1 WD50 276 50
## 134 6 1 WD50 284 50
## 135 7 1 WD50 282 50
## 136 8 1 WD50 273 50
## 137 9 2 WD50 456 50
## 138 10 2 WD50 475 50
## 139 11 2 WD50 462 50
## 140 12 2 WD50 612 50
## 141 13 3 WD50 507 50
## 142 14 3 WD50 543 50
## 143 15 3 WD50 553 50
## 144 16 3 WD50 550 50
## 145 1 1 WD57 272 57
## 146 2 1 WD57 247 57
## 147 3 1 WD57 268 57
## 148 4 1 WD57 273 57
## 149 5 1 WD57 278 57
## 150 6 1 WD57 279 57
## 151 7 1 WD57 281 57
## 152 8 1 WD57 274 57
## 153 9 2 WD57 468 57
## 154 10 2 WD57 484 57
## 155 11 2 WD57 466 57
## 156 12 2 WD57 618 57
## 157 13 3 WD57 518 57
## 158 14 3 WD57 544 57
## 159 15 3 WD57 555 57
## 160 16 3 WD57 553 57
## 161 1 1 WD64 278 64
## 162 2 1 WD64 245 64
## 163 3 1 WD64 269 64
## 164 4 1 WD64 275 64
## 165 5 1 WD64 280 64
## 166 6 1 WD64 281 64
## 167 7 1 WD64 284 64
## 168 8 1 WD64 278 64
## 169 9 2 WD64 478 64
## 170 10 2 WD64 496 64
## 171 11 2 WD64 472 64
## 172 12 2 WD64 628 64
## 173 13 3 WD64 525 64
## 174 14 3 WD64 559 64
## 175 15 3 WD64 548 64
## 176 16 3 WD64 569 64
dim(rats)
## [1] 16 13
summary(rats)
## ID Group WD1 WD8 WD15
## 1 : 1 1:8 Min. :225.0 Min. :230.0 Min. :230.0
## 2 : 1 2:4 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## 3 : 1 3:4 Median :340.0 Median :345.0 Median :347.5
## 4 : 1 Mean :365.9 Mean :369.1 Mean :372.5
## 5 : 1 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## 6 : 1 Max. :555.0 Max. :560.0 Max. :565.0
## (Other):10
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
##
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
##
dim(ratsl)
## [1] 176 5
names(ratsl)
## [1] "ID" "Group" "WDs" "weight" "WD"
str(ratsl)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WDs : chr "WD1" "WD1" "WD1" "WD1" ...
## $ weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD : int 1 1 1 1 1 1 1 1 1 1 ...
summary(ratsl)
## ID Group WDs weight WD
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
ggplot(ratsl, aes(x = WD, y = weight, linetype = ID)) + geom_line() +
scale_linetype_manual(values = rep(1:10, times = 4)) + facet_grid(.~Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(ratsl$weight), max(ratsl$weight)))
We can take tracking into account when we standardize the data and plot the standardized responses. Standardization makes comparind individuals in the same groups and between groups more applicable. We can see from the plot below that the same individuals stand out as previously.
The individuals in Group 1 seem to gain weight quite steeply compared to at least Group 2. One individual whose weight dramatically drops in the beginning can be pointed out.
The individuals in Group 2 are not gaining weight as much compared to other groups. However, the growth is steadier.
The individuals in Group 3 are gaining weight more than Group 2 and more steadily than individuals in Group 1. All individuals in this group gain weight.
# Standardized BPRS plots by groups
ratsl <- ratsl %>% group_by(Group) %>% mutate(weights_standardized = (weight - mean(weight))/sd(weight)) %>% ungroup()
ggplot(ratsl, aes(x = WD, y = weights_standardized, linetype = ID)) + geom_line() +
scale_linetype_manual(values = rep(1:10, times = 4)) + facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
Next we will analyse the average response profiles of each group of individuals. In the plot below, the standard error mean acts as an indicator of variance. We can see that the curves or their variances do not overlap. The mean starting weights of the groups were different, Groups 2 and 3 more closer to each other as mentioned before. As said before, Group 2 has the highest variance. Group 1 seems to have the smallest variance. The average weight of the individuals in a group grew the most in Group 2.
# Summary measure analysis of longitudinal data
n <- ratsl$WD %>% unique() %>% length()
ratss <- ratsl %>% group_by(Group, WD) %>% summarise( mean = mean(weight), se = sd(weight)/sqrt(n)) %>% ungroup()
# glimpse(ratss)
# Plot the mean profiles
ggplot(ratss, aes(x = WD, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
We can also compare the avarage weights of the groups by boxplotting the data as below. The first-day respose/observation was removed from the data. We can see similar results from the boxplots as discussed above. However, now we can see more clearly that some outliers exist in each group.
# Create a summary data by treatment and subject with mean as the summary variable.
ratsl8s <- ratsl %>%
filter(WD > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(weight)) %>%
ungroup()
# Glimpse the data
# glimpse(ratsl8s)
# Draw a boxplot of the mean versus group
ggplot(ratsl8s, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), time 0-60")
When we filter the outliers, we can see that the variance decreass in each group. For Group 1, the mean grows a bit (from around 265 to 275). For Group 2, the mean decreases from around 490 to 450. For Group 3, the mean increases from around 530 to 540. Now we can see more clearly that the weights of the individuals in Groups 2 and 3 are on different levels.
# Filtering the outliers
ratsl8s_1 <- ratsl8s %>% filter(Group == 3)
ratsl8s_1 <- ratsl8s_1 %>% filter(mean > 500)
ratsl8s_2 <- ratsl8s %>% filter(Group !=3)
ratsl8s_2 <- ratsl8s_2 %>% filter(mean>250) %>% filter(mean < 500)
ratsl8s_filtered <- rbind(ratsl8s_2, ratsl8s_1)
ggplot(ratsl8s_filtered, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(weight), whole study period")
To analyse differences between the Groups in more detail, we perform a simple 2-sided t-test. We will perform the t-test with 3 different pairs of Groups because the test can handle only 2 groups at a time.
We can see that the Group 1 and Group2, as well as Group 1 and Group 3, distinguish from each other. The p-values return values between the level of significance. However, the p-value of testing the difference between the Groups 2 and 3 does not include to the significant level (p = 0.3263). Therefore we can conclude that the individuals in groups 2 and 3 can be considered representing the same type of individuals (when having weight as the measure).
ratsl8s12 <- ratsl8s %>% filter(Group != 3)
t.test(mean ~ Group, data = ratsl8s12, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 487.800
ratsl8s23 <- ratsl8s %>% filter(Group != 1)
t.test(mean ~ Group, data = ratsl8s23, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -1.0686, df = 6, p-value = 0.3263
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -130.60428 51.20428
## sample estimates:
## mean in group 2 mean in group 3
## 487.8 527.5
ratsl8s13 <- ratsl8s %>% filter(Group != 2)
t.test(mean ~ Group, data = ratsl8s13, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3
## 265.025 527.500
Part II: Linear Mixed Effects Models for Normal Response Variables
Now we move our focus to the BPRS data. There are 40 individuals that have been randomized to a treatment Group 1 (n=20) and Treatment Group 2 (n=20). We observe the individuals’ change in the bprs value for 8 weeks.
BPRS
## treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 42 36 36 43 41 40 38 47 51
## 2 1 2 58 68 61 55 43 34 28 28 28
## 3 1 3 54 55 41 38 43 28 29 25 24
## 4 1 4 55 77 49 54 56 50 47 42 46
## 5 1 5 72 75 72 65 50 39 32 38 32
## 6 1 6 48 43 41 38 36 29 33 27 25
## 7 1 7 71 61 47 30 27 40 30 31 31
## 8 1 8 30 36 38 38 31 26 26 25 24
## 9 1 9 41 43 39 35 28 22 20 23 21
## 10 1 10 57 51 51 55 53 43 43 39 32
## 11 1 11 30 34 34 41 36 36 38 36 36
## 12 1 12 55 52 49 54 48 43 37 36 31
## 13 1 13 36 32 36 31 25 25 21 19 22
## 14 1 14 38 35 36 34 25 27 25 26 26
## 15 1 15 66 68 65 49 36 32 27 30 37
## 16 1 16 41 35 45 42 31 31 29 26 30
## 17 1 17 45 38 46 38 40 33 27 31 27
## 18 1 18 39 35 27 25 29 28 21 25 20
## 19 1 19 24 28 31 28 29 21 22 23 22
## 20 1 20 38 34 27 25 25 27 21 19 21
## 21 2 1 52 73 42 41 39 38 43 62 50
## 22 2 2 30 23 32 24 20 20 19 18 20
## 23 2 3 65 31 33 28 22 25 24 31 32
## 24 2 4 37 31 27 31 31 26 24 26 23
## 25 2 5 59 67 58 61 49 38 37 36 35
## 26 2 6 30 33 37 33 28 26 27 23 21
## 27 2 7 69 52 41 33 34 37 37 38 35
## 28 2 8 62 54 49 39 55 51 55 59 66
## 29 2 9 38 40 38 27 31 24 22 21 21
## 30 2 10 65 44 31 34 39 34 41 42 39
## 31 2 11 78 95 75 76 66 64 64 60 75
## 32 2 12 38 41 36 27 29 27 21 22 23
## 33 2 13 63 65 60 53 52 32 37 52 28
## 34 2 14 40 37 31 38 35 30 33 30 27
## 35 2 15 40 36 55 55 42 30 26 30 37
## 36 2 16 54 45 35 27 25 22 22 22 22
## 37 2 17 33 41 30 32 46 43 43 43 43
## 38 2 18 28 30 29 33 30 26 36 33 30
## 39 2 19 52 43 26 27 24 32 21 21 21
## 40 2 20 47 36 32 29 25 23 23 23 23
dim(BPRS)
## [1] 40 11
summary(BPRS)
## treatment subject week0 week1 week2
## 1:20 1 : 2 Min. :24.00 Min. :23.00 Min. :26.0
## 2:20 2 : 2 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## 3 : 2 Median :46.00 Median :41.00 Median :38.0
## 4 : 2 Mean :48.00 Mean :46.33 Mean :41.7
## 5 : 2 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## 6 : 2 Max. :78.00 Max. :95.00 Max. :75.0
## (Other):28
## week3 week4 week5 week6
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75
## Median :36.50 Median :34.50 Median :30.50 Median :28.50
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00
##
## week7 week8
## Min. :18.0 Min. :20.00
## 1st Qu.:23.0 1st Qu.:22.75
## Median :30.0 Median :28.00
## Mean :32.2 Mean :31.43
## 3rd Qu.:38.0 3rd Qu.:35.25
## Max. :62.0 Max. :75.00
##
dim(BPRSL)
## [1] 360 5
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs" "week"
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
We first take a look at the data by plotting the bprs values against weeks for the data. We do not draw the lines but identify the group to which each observation belongs. We can see that the individuals who point out with the highest values belong to the Group 2 almost each week. Otherwise we can see that the responses overlap strongly. We can assume a pattern where the bprs values decrease accross time for both groups. It also seems that there is a small increase of the bprs values during the last week but we cannot yet say if there has been that kind of an inrease or not.
ggplot(BPRSL, aes(x = week, y = bprs)) + geom_text(aes(label = treatment)) + scale_x_continuous(name = "Time (days)") + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")
When analysing the individual response profiles by treatment group, we do not see anything relevant from the data. The groups to not seem very different, except the same observation that could be seen in the previous plot (the individuals who point out with the highest values belong to the Group 2 almost each week).
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Now we analyse the differences of the groups by building a linear regression model. I build a model with week, treatment, and week*treatment as explaining variables. Only week turned out statistically significant. According to these results, we can hypothesise that the effect of treatments did not differ from each other.
# building a linear regression model
BPRS_model <- lm(data = BPRSL, bprs ~ week + treatment + week*treatment)
summary(BPRS_model)
##
## Call:
## lm(formula = bprs ~ week + treatment + week * treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.886 -9.267 -2.801 6.513 51.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.8856 1.6970 28.218 < 2e-16 ***
## week -2.6283 0.3564 -7.374 1.17e-12 ***
## treatment2 -2.2911 2.3999 -0.955 0.340
## week:treatment2 0.7158 0.5041 1.420 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.35 on 356 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1829
## F-statistic: 27.78 on 3 and 356 DF, p-value: 3.629e-16
Now we will take a look at a more appropriate model for this kind of data, the Random Intercept Model. In the model we first use week and treatment as fixed-effect terms and (1 | subject) as a term implicating random-effect.
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Next we use week and treatment as fixed-effect terms and (week | subject) as the random-effect term.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8202 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
When performing ANOVA between the two models, we can see from the results (chi-squared and p-value) that the random intercept and random slope model performs better and has a higher significance with lower values.
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1